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Cost-Sensitive Learning of Deep Feature 
Representations from Imbalanced Data 

S. H. Khan, M. Hayat, M. Bennamoun, F. Sohel and R. Togneri 


Abstract —Class imbalance is a common problem in tbe case of 
real-world object detection and classification tasks. Data of some 
classes is abundant making them an over-represented majority, 
and data of other classes is scarce, making them an under¬ 
represented minority. This imbalance makes it challenging for 
a classifier to appropriately learn the discriminating boundaries 
of the majority and minority classes. In this work, we propose a 
cost-sensitive deep neural network which can automatically learn 
robust feature representations for both the majority and minority 
classes. During training, our learning procedure jointly optimizes 
the class-dependent costs and the neural network parameters. 
The proposed approach is applicable to both binary and multi¬ 
class problems without any modification. Moreover, as opposed 
to data level approaches, we do not alter the original data 
distribution which results in a lower computational cost during 
the training process. We report the results of our experiments 
on six major image classification datasets and show that the 
proposed approach significantly outperforms the baseline algo¬ 
rithms. Comparisons with popular data sampling techniques and 
cost-sensitive classifiers demonstrate the superior performance of 
our proposed method. 

Index Terms —Cost-sensitive learning. Convolutional Neural 
Networks, Data Imbalance, Loss functions. 


I. Introduction 

In most real-world classification problems, the collected 
data follows a long tail distribution i.e., data for few object 
classes is abundant while data for others is scarce. This 
behaviour is termed the ‘class-imbalance problem’ and it is 
inherently manifested in nearly all of the collected image 
classification databases (e.g.. Fig. 1). A multi-class dataset is 
said to be ‘imbalanced’ or ‘skewed’ if some of its (minority) 
classes, in the training set, are heavily under-represented com¬ 
pared to other (majority) classes. This skewed distribution of 
class instances forces the classification algorithms to be biased 
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towards the majority classes. As a result, the characteristics of 
the minority classes are not adequately learned. 

The class imbalance problem is of particular interest in 
real-world scenarios, where it is essential to correctly classify 
examples from an infrequent but important minority class. 
For instance, a particular cancerous lesion (e.g., a melanoma) 
which appears rarely during dermoscopy should not be mis- 
classified as benign (see Sec. ng. Similarly, for a continuous 
surveillance task, a dangerous activity which occurs occasion¬ 
ally should still be detected by the monitoring system. The 
same applies to many other application domains, e.g., object 
classification, where the correct classification of a minority 
class sample is equally important to the correct classification 
of a majority class sample. It is therefore required to enhance 
the overall accuracy of the system without unduly sacrificing 
the precision of any of the majority or minority classes. Most 
of the classification algorithms try to minimize the overall 
classification error during the training process. They, therefore, 
implicitly assign an identical misclassification cost to all types 
of errors assuming their equivalent importance. As a result 
the classifier tends to correctly classify and favour the more 
frequent classes. 

Despite the pertinence of the class imbalance problem to 
practical computer vision, there have been very few research 
works on this topic in the recent years. Class imbalance is 
avoided in nearly all competitive datasets during the eval¬ 
uation and training procedures (see Fig. [^. For instance, 
for the case of the popular image classification datasets 
(such as CIFAR—10/100, ImageNet, Caltech—101/256, and 
MIT—67), efforts have been made by the collectors to ensure 
that, either all of the classes have a minimum representation 
with sufficient data, or that the experimental protocols are 
reshaped to use an equal number of images for all classes 
during the training and testing processes lU m a. This 
approach is reasonable in the case of datasets with only 
few classes, which have an equal probability to appear in 
practical scenarios (e.g., digits in MNIST). However, with the 
increasing number of classes in the collected object datasets, it 
is becoming impractical to provide equal representations for all 
classes in the training and testing subsets. For example, for 
a fine-grained coral categorization dataset, endangered coral 
species have a significantly lower representation compared to 
the more abundant ones a. 

In this work, we propose to jointly learn robust feature 
representations and classifier parameters, under a cost-sensitive 
setting. This enables us to learn not only an improved classifier 
that deals with the class imbalance problem, but also to extract 
suitably adapted intermediate feature representations from a 
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Fig. 1: Examples of popular classification datasets in which the number of images vary sharply across different classes. This class imbalance 
poses a challenge for classification and representation learning algorithms. 


deep Convolutional Neural Network (CNN). In this manner, 
we directly modify the learning procedure to incorporate class- 
dependent costs during training. In contrast, previous works 
(such as |l5][6l|2l|8]) only readjust the training data distribution 
to learn better classifiers. Moreover, unlike the methods in 
e.g., Bi a, we do not use a handcrafted cost matrix whose 
design is based on expert judgement and turns into a tedious 
task for a large number of classes. In our case, the class- 
dependent costs are automatically set using data statistics 
(e.g., data distribution and separability measures) during the 
learning procedure. Another major difference with existing 
techniques is that our class specific costs are only used during 
the training process and once the optimal CNN parameters are 
learnt, predictions can be made without any modification to the 
trained network. From this perspective, our approach can be 
understood as a perturbation method, which forces the training 
algorithm to learn more discriminative features. Nonetheless, it 
is clearly different from the common perturbation mechanisms 
used during training e.g., data distortions cni, corrupted fea¬ 
tures ini, affine transformations na and activation dropout 

na. 

Our contribution consists of the following: 1- We introduce 
cost-sensitive versions of three widely used loss functions 
for joint cost-sensitive learning of features and classifier 
parameters in the CNN (Sec. III-C| l. We also show that the 
improved loss functions have desirable properties such as 
classification calibration and guess-aversion. 2- We analyse 
the effect of these modified loss functions on the back- 
propagation algorithm by deriving relations for propagated 
gradients (Sec. III-E| l. 3- We propose an algorithm for joint 
alternate optimization of the network parameters and the class- 
sensitive costs (Sec. III-D| l. The proposed algorithm can auto¬ 
matically work for both binary and multi-class classification 
problems. We also show that the introduction of class-sensitive 
costs does not significantly affect the training and testing time 
of the original network (Sec. ^.4- The proposed approach 
has been extensively tested on six major classification datasets 


and has shown to outperform baseline procedures and state- 
of-the-art approaches (Sec. IV-D| i. 

The remainder of this paper is organized as follows. We 
briefly discuss the related work in the next section. In Sec. 
III-A and III-B we introduce our proposed approach and 


analyse the modified loss functions in Sec. III-C The learning 
algorithm is then described in Sec. III-D and the CNN imple¬ 
mentation details are provided in Sec. IV-C| Experiments and 
results are summarized in Sec. irvl and the paper concludes in 
Sec.|V] 


II. Related Work 

Previous research on the class imbalance problem has 
concentrated mainly on two levels; the data level and the 
algorithmic level m. Below, we briefly discuss the different 
research efforts that tackle the class imbalance problem. 

Data level approaches manipulate the class representa¬ 
tions in the original dataset by either over-sampling the 
minority classes or under-sampling the majority classes to 
make the resulting data distribution balanced iflTll . However, 
these techniques change the original distribution of the data 
and consequently introduce drawbacks. While under-sampling 
can potentially lose useful information about the majority 
class data, over-sampling makes the training computationally 
burdensome by artificially increasing the size of the training 
set. Eurthermore, over-sampling is prone to cause over-fitting, 
when exact copies of the minority class are replicated ran¬ 
domly ||5j fT4l . 

To address the over-fitting problem, Chawla et al. 13 intro¬ 
duced a method, called SMOTE, to generate new instances by 
linear interpolation between closely lying minority class sam¬ 
ples. These synthetically generated minority class instances 
may lie inside the convex hull of the majority class instances, 
a phenomenon known as over-generalization. Over the years, 
several variants of the SMOTE algorithm have been proposed 
to solve this problem ca. Eor example. Borderline SMOTE 
m only over-samples the minority class samples which 
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lie close to the class boundaries. Safe-level SMOTE El 
carefully generates synthetic samples in the so called safe- 
regions, where the majority and minority class regions are not 
overlapping. The local neighborhood SMOTE ITSl considers 
the neighboring majority class samples when generating syn¬ 
thetic minority class samples and reports a better performance 
compared to the former variants of SMOTE. The combination 
of under and over sampling procedures (e.g., |]8] [l^ |^) to 
balance the training data have also shown to perform well. 
However, a drawback of these approaches is the increased 
computational cost that is required for data pre-processing and 
for the learning of a classihcation model. 


Algorithm level approaches directly modify the learning 
procedure to improve the sensitivity of the classiher towards 
minority classes. Zhang et al. Q hrst divided the data into 
smaller balanced subsets, followed by intelligent sampling 
and a cost-sensitive SVM learning to deal with the imbalance 
problem. A neuro-fuzzy modeling procedure was introduced 
in El to perform leave-one-out cross-validation on imbal¬ 
anced datasets. A scaling kernel along-with the standard SVM 
was used in ED to improve the generalization ability of 
learned classihers for skewed datasets. Li et al. gave 
more importance to the minority class samples by setting 
weights with Adaboost during the training of an extreme 
learning machine (ELM). An ensemble of soft-margin SVMs 
was formed via boosting to perform well on both majority 
and minority classes El. These previous works hint towards 
the use of distinct costs for different training examples to 
improve the performance of the learning algorithm. However, 
they do not address the class imbalance learning of CNNs, 
which have recently emerged as the most popular tool for su¬ 
pervised classihcation, recognition and segmentation problems 
in computer vision Eurthermore, they are 

mostly limited to the binary class problems ll24l l28l . do not 
perform joint feature and classiher learning and do not explore 
computer vision tasks which inherently have imbalanced class 
distributions. In the context of neural networks, Kukar and 
Kononenko showed that the incorporation of costs in the 
error function improves performance. However, their costs are 
randomly chosen in multiple runs of the network and remain 
hxed during the learning process in each run. In contrast, this 
paper presents the hrst attempt to incorporate automatic cost- 
sensitive learning in deep neural networks for imbalanced data. 


After the submission of this work for review, we note that a 
number of new approaches have recently been proposed to in¬ 
corporate class-specihc costs in the deep networks 
Chung et al. |[^ proposed a new cost-sensitive loss function 
which replaces traditional soft-max with a regression loss. In 
contrast, this work extends the traditionally used cost-functions 
in CNN for the cost-sensitive setting. Wang et al. ED and 
Raj et al. Il32l proposed a loss function which gives equal 
importance to mistakes in the minority and majority classes. 
Different to these works, our method is more Hexible because 
it automatically learns the balanced error function depending 
on the end problem. 


HI. Proposed Approach 


A. Problem Formulation for Cost-Sensitive Classification 

Let the cost ^ be used to denote the misclassihcation cost 
of classifying an instance belonging to a class p into a different 
class q. The diagonal of (i.e., y, Vp) represents the beneht 
or utility for a correct prediction. Given an input instance x 
and the cost matrix the classiher seeks to minimize the 
expected risk TZ{p\x), where p is the class prediction made 
by the classiher. The expected risk can be expressed as: 

where, P{q\x) is the posterior probability over all possible 
classes given an instance x. According to the Bayes decision 
theory, an ideal classiher will give a decision in favour of the 
class (p*) with the minimum expected risk: 


p* = argmin 7?.(p|x) = argmin Ex,d[C1 (1) 

p p 

where, X and D dehne the input and output spaces respec¬ 
tively. Since, P{q\x.) cannot be found trivially, we make use 
of empirical distribution derived from the training data. Given 
a training dataset consisting of tuples comprising of data and 
label, V = {x*^*\ where d G M.^, we can dehne the 

empirical risk as follows: 


M 


P,(o) =Ex,d[^] = 


( 2 ) 


where, M is the total number of images, G is the 
neural network output for the sample and if) is the 
misclassihcation error (0—1 loss) or a surrogate loss function 
which is typically used during the classiher training. Eor the 
case of cost-insensitive 0—1 loss, £(^', d(*\ f 

o (*)) and f' is an N X N matrix, where = 0, and 
f'p ^ = 1, Wp f q. Next, we briehy describe the properties of 
traditional used cost matrix before introducing the proposed 
cost matrix. 

Properties of the Cost Matrix : Lemmas III. 1 and III.2 


describe the main properties of the cost matrix f'. Their proof 
can be found in Appendix A (supplementary material). 


Lemma III.l. Offsetting the columns of the cost matrix f' by 
any constant ‘c’ does not affect the associated classification 
risk TZ. 


Eor convenience, the utility vector (i.e., the diagonal of the cost 
matrix) for correct classihcation is usually set to zero with the 
help of the property from Lemma III.1| We also show next that 
even when the utility is not zero, it must satisfy the following 
condition: 


Lemma III.2. The cost of the true class should be less than 
the mean cost of all misclassifications. 


Einally, using Lemmas III.l and III.2| we assume the follow¬ 
ing: 


Assumption. All costs are non-negative i.e., f P 0. 
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The entries of a traditional cost matrix (defined according 
to the properties above) usually have the form of: 




^p,q ^ P Q 
i'p,q G N p^q. 


(3) 


Such cost matrix can potentially increase the corresponding 
loss to a large value. During the CNN training, this network 
loss can make the training process unstable and can lead to 
the non-convergence of the error function. This requires the 
introduction of an alternative cost matrix. 


Input Output Class-dependent Loss 

Layer Layer Costs (^) Layer 



Fig. 2: The CNN parameters (0) and class-dependent costs (0 used 
during the training process of our deep network. Details about the 
CNN architecture and the loss layer are in Sec. |IV-C| and |III-C[ 
respectively 


B. Our Proposed Cost Matrix 

We propose a new cost matrix which is suitable for CNN 
training. The cost matrix ^ is used to modify the output of the 
last layer of a CNN (before the softmax and the loss layer) 
(Fig.|g. The resulting activations are then squashed between 
[0,1] before the computation of the classification loss. 

For the case of a CNN, the classification decision is made 
in favour of the class with the maximum classification score. 
During the training process, the classifier weights are modified 
in order to reshape the classifier confidences (class probabili¬ 
ties) such that the desired class has the maximum score and the 
other classes have a considerably lower score. However, since 
the less frequent classes are under-represented in the training 
set, we introduce new ‘score-level costs’ to encourage the 
correct classification of infrequent classes. Therefore the CNN 
outputs (o) are modified using the cost matrix (0 according 
to a function (F) as follows: 

y(*) = : yW > ^ 


where, y denotes the modified output, p is the desired class 
and : K —K represents a function whose exact definition 
depends on the type of loss layer. As an example, for the case 
of cost-sensitive MSB loss, = sigmoid(0 oo^®)), 

where o denotes the hadamard product. In Sec. |III-C| we will 
discuss in detail the definition of F for different surrogate 
losses. Note that the score-level costs perturb the classifier 
confidences. Such perturbation allows the classifier to give 
more importance to the less frequent and difficult-to-separate 
classes. 

Properties of the Proposed Cost Matrix 0' Next, we 
discuss few properties (lemmas |A.3| - |A.6| l of the newly 
introduced cost matrix ^ and its similarities/differences with 
the traditionally used cost matrix 0 (Sec. III-Ai. The proofs 
of below mentioned properties can be found in Appendix A 
(supplementary material): 


Lemma III.3. The cost matrix ^ for a cost-insensitive loss 
function is an all-ones matrix, IP^p, rather than a 1—I matrix, 
as in the case of the traditionally used cost matrix 0. 

Lemma III.4. All costs in ^ are positive, i.e., ^ 0. 


Lemma III.5. The cost matrix ^ is defined such that all of its 
elements in are within the range (0,1], i.e., € (0,1]. 

Lemma III.6. Offsetting the columns of the cost matrix ^ can 
lead to an equally probable guess point. 


The cost matrix ^ configured according to the properties 
described above (Lemma A.3 - |A.6| l neither excessively in¬ 
creases the CNN outputs activations, nor does it reduce them 
to zero output values. This enables a smooth training process 
allowing the model parameters to be correctly updated. In the 
following section, we analyse the implications of the newly 
introduced cost matrix ^ on the loss layer (Fig. |^. 


C. Cost-Sensitive Surrogate Losses 

Our approach addresses the class imbalance problem during 
the training of CNNs. For this purpose, we introduce a cost- 
sensitive error function which can be expressed as the mean 
loss over the training set: 

1 “ 

= (4) 

^ i=l 

where, the predicted output (y) of the penultimate layer 
(before the loss layer) is parameterized by 0 (network weights 
and biases) and ^ (class sensitive costs), M is the total number 
of training examples, d S {0,10^^ is the desired output (s.t. 
Thn ^ denotes the total number of neurons in the 

output layer. For conciseness, we will not explicitly mention 
the dependence of y on the parameters {0 ,0 and only consider 
a single data instance in the discussion below. Note that the 
error is larger when the model performs poorly on the training 
set. The objective of the learning algorithm is to find the 
optimal parameters (0*, 0) which give the minimum possible 
cost E* (Eq. Q). Therefore, the optimization objective is 
given by: 

(000) = arg min £’(61,0. (5) 

The loss function £(■) in Eq. 0 can be any suitable 
surrogate loss such as the Mean Square Error (MSE), Support 
Vector Machine (SVM) hinge loss or a Cross Entropy (CE) 
loss (also called the ‘soft-max log loss’). These popular 
loss functions are shown along-with other surrogate losses in 
Fig. 0 The cost-sensitive versions of these loss functions are 
discussed below: 

(a) Cost-Sensitive MSE loss: This loss minimizes the squared 
error of the predicted output with the desired ground-truth and 
can be expressed as follows: 

= ^^{dn - Vnf ( 6 ) 
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Neuron Output 

Fig. 3: This figure shows the 0-1 loss along-with several other com¬ 
mon surrogate loss functions that are used for binary classification. 


where, t/„ is related to the output of the previous layer o„ via 
the logistic function. 


1 

1 -f exp(-o„^p,„)’ 


(7) 


where, ^ is the class sensitive penalty which depends on 
the desired class of a particular training sample, i.e., p = 
argmax^ dm- The effect of this cost on the back-propagation 


algorithm is discussed in Sec. III-El 


(b) Cost-Sensitive SVM hinge loss: This loss maximizes the 
margin between each pair of classes and can be expressed as 
follows: 


{2d^-l)y„), (8) 

n 

where, yn can be represented in terms of the previous layer 
output o„ and the cost as follows: 


Vn — OnCp,r 


(9) 


classifiers which have risks that are closer to the Bayes-risk. 
Similarly, guess aversion implies that the loss function favours 
‘correct classification’ instead of ‘arbitrary guesses’. Since, CE 
loss usually performs best among the three loss functions we 
discussed above ll^lTSll . Lemmas [ill.7f|III.9| show that the cost- 
sensitive CE loss is guess aversive and classification calibrated. 

Lemma III.7. For a real valued ^ S s (0, l]j, given 

ayid fhg CNN output the modified cost-sensitive CE 
loss will be guess-averse iff, 

L(^,d,o) < L(^,d,g), 

where, g is the set of all guess points. 

Proof: Eor real valued CNN activations, the guess point 
maps to an all zero output: 

L(^,d,o) < L(^,d,0), 


VLfeWexp(ofc)y V 


Cp,ri 




which can be satisfied if. 


^p,nGXp(On) ^ ip,n 

Sfe 6^P(Ofc) 


where, n is the true class. Since, G (0,1] and thus it 
is > 0. Also, if n is the true class then On > Ok, n. 

Therefore, the above relation holds true. ■ 


Lemma III.8. The cost matrix has diagonal entries greater 
than zero, i.e., diag{^) > 0. 


Proof: According to Lemma III. 1 if the CE loss is guess 
aversive, it must satisfy , 


The effect of the introduced cost on the gradient computation 
is discussed in Sec. IIII-E2I 

(c) Cost-Sensitive CE loss: This loss maximizes the closeness 
of the prediction to the desired output and is given by: 


L(^,d,o) < L(^,d,0). 

We prove the Lemma by contradiction. Let us suppose that 
= 0, then the above relation does not hold true, since: 


^(d,y) =(10) 

n 


where incorporates the class-dependent cost (0 and is 
related to the output o„ via the soft-max function. 


Vn 


exp(o„) 

ECp.fcexp(ofc)’ 

k 


( 11 ) 


The effect of the modified CE loss on the back-propagation 


algorithm is discussed in Sec. III-E3 


Classification Feasibility of Cost-Sensitive Losses: Next, 
we show (Lemmas |IIL7f|IIL9] l that the cost-sensitive loss 
functions remain suitable for classification since they satisfy 
the following properties: 

1) Classification Calibration 

2) Guess Aversion El 


Note that a classification calibrated (c-calibrated) loss is use¬ 
ful because the minimization of the empirical risk leads to 


Cp,nGXp(On) _ ^p,n _ q 

Sfe ^P,k exp(ofc) ^P,k 

and hence, diag(0 >0. ■ 

Lemma III.9. The cost-sensitive CE loss function 


^(C,d,o) = - Vd„log 


0,n exp(o„) 

ECp,feexp(ofc) 
k 


is C-Calibrated. 

Proof: Given an input sample x which belongs to class 
p* (i.e., dp, = 1), then the CE loss can be expressed as: 


^(C,d,o) = - log 


exp(Op.) 

ECp.feexp(o^) 

k 
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The classification risk can be expressed in terms of the 
expected value as follows: 

Tlt[o] = d, o)] 

N N 

p—1 p—1 

D/ I M I ^P’P 

= - V P{pW) log 

\^E?P.fcexp(oJ 

Next, we compute the derivative and set it to zero to find the 
ideal set of CNN outputs ‘o’. 


d'R-e.lo] 

dot 

dTZilop] 


dTZi[op 


dot 


dUAop 


p¥=t 


dot 


= 0 


p=t 


dot 

N 


N 




p—1 k / 


= -P(t|x) + P(f|x). 


6,texp(ot) 

N 

E 6.feexp(o^) 


Similarly, 


dTZijop] 

dot 


=i:p(pix) 

pAt ECp,feexp(o^) 

fc = l 


By adding the above two derived expression and setting 
them to zero, we have : 


N 


P(f|x) = exp{ot)'^ 


-P(p|x)Cp,t 

N 

E Cp.fc exp(o^^) 

fc=l 


Ot = log(P(f|x)) - log P(p|x)^p, 4 ^ 

( N N \ 

EE 

p^lk^l / 

Which shows that there exists an inverse relationship between 
the optimal CNN output and the Bayes cost of the 
class, and hence, the cost-sensitive CE loss is classification 
calibrated. ■ 


Under the properties of Lemmas III.7 III.9 the modified 
loss functions are therefore suitable for classification. Having 
established the class-dependent costs (Sec. |III-B| i and their 
impact on the loss layer (Sec. III-Cl, we next describe the 


training algorithm to automatically learn all the parameters of 
our model {9 and 0. 


D. Optimal Parameters Learning 

When using any of the previously mentioned loss functions 
(Eqs. (6p0l), our goal is to jointly learn the hypothesis param¬ 
eters 9 and the class-dependent loss function parameters Eor 


Algorithm 1 Iterative optimization for parameters (9,^) 

Input: Training set (x, d). Validation set (xy, dy). Max. 
epochs (Mep), Learning rate for 9 (je), Learning rate for 
C (7?) 

Output: Learned parameters (9*, ^*) 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


Net 5— construct_CNN() 
9 5— initialize_Net(Net) 
^5—1, val-err 5— 1 
for e e [1, Mep] do 


[> Random initialization 


t> Number of epochs 


grad^ 5— compute-grad(x, d, F(^)) > Eq. (161 

5— update-CostParams(^, 7 ^, gradj) 

for b S [1, B] do > Number of batches 

outf, 5— forward-pass(xf,, dh,Net, 9) 
gradf, 5— backward-pass(outb,xt,,db,Net,6>,^) 

9* 5— update-NetParams(Net,0,76),gradj) 

9^9* 
end for 

val-err* 5— forward-pass(xy, dy, Net, 9) 
if val-err* > val-err then 


7l 5— 7j * 0.01 

val-err 5— val-err* 
end if 
end for 
return (9*, ^*) 


> Decrease step size 


the joint optimization, we alternatively solve for both types of 
parameters by keeping one fixed and minimizing the cost with 
respect to the other (Algorithm [T]). Specifically, for the opti¬ 
mization of 9, we use the stochastic gradient descent (SGD) 
with the back-propagation of error (Eq. (0). Next, to optimize 
we again use the gradient descent algorithm to calculate 
the direction of the step to update the parameters. The cost 
function is also dependent on the class-to-class separability, the 
current classification errors made by the network with current 
estimate of parameters and the overall classification error. The 
class-to-class (c2c) separability is measured by estimating the 
spread of the with-in class samples (intraclass) compared to the 
between-class (interclass) ones. In other words, it measures the 
relationship between the with-in class sample distances and the 
size of the separating boundary between the different classes. 
Note that the proposed cost function can be easily extended 
to include an externally defined cost matrix for applications 
where expert opinion is necessary. However, this paper mainly 
deals with class-imbalance in image classification datasets 
where externally specified costs are not required. 

To calculate the c2c separability, we first compute a suitable 
distance measure between each point in a class Cp and its 
nearest neighbour belonging to Cp and the nearest neighbour in 
class Cq. Note that these distances are calculated in the feature 
space where each point is a 4096 dimensional feature vector 
ifi ■ i & [l,fV'], N' bieng the samples belonging to class 
Cp) obtained from the penultimate CNN layer (just before the 
output layer). Next, we find the average of intraclass distances 
to interclass distance for each point in a class and compute 
the ratio of the averages to find the c2c separability index. 
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Formally, the class separability between two classes, p and q 
is defined as: 

_ 1 distintraNNifi) 

N' ^ dist^nterNN{fi) 

To avoid over-fitting and to keep this step computationally 
feasible, we measure the c2c separability on a small validation 
set. Also, the c2c separability was found to correlate well with 
the confusion matrix at each epoch. Therefore the measure was 
calculated after every 10 epochs to minimize the computational 
overhead. Note that by simply setting the parameters (0 based 
on the percentages of the classes in the data distribution results 
in a poor performance (Sec. [IV^ . This suggests that the 
optimal parameter values for class-dependent costs (0) should 
not be the same as the frequency of the classes in the training 
data distribution. The following cost function is used for the 
gradient computation to update 0 

F{0=\\T-^\\l+E,,i{e,0. (12) 


where Eyai is the validation error. The matrix T is defined as 
follows: 


r = ), (13) 

where, p, a denote the parameters which are set using cross 
validation, R denotes the current classification errors as a 
confusion matrix, S denotes the class c2c separability matrix 
and H is a matrix defined using the histogram vector h which 
encodes the distribution of classes in the training set. The 
matrix H and vector h are linked as follows: 


H{p,q) 


max(fip, hq) :p^ q, {p, q) G c, 
hp :p = q,pec 


(14) 


where, c is the set of all classes in a given dataset. The 
resulting minimization objective to find the optimal 0 can 
be expressed as: 


0 = argmin F"(0. (15) 

In order to optimize the cost function in Eq. ( [T5] l, we use the 
gradient descent algorithm which computes the direction of 
the update step, as follows: 

VJ^(0 = V(Va - Vb)(Va - Vfc)^ 

= (Va - Vf,) = -(Va - (16) 

where, = vec{T), Vf, = vec(^) and J denotes the Jacobian 
matrix. Note that in order to incorporate the dependence of 
F(0 on the validation error Eyai, we take the update step 
only if it results in a decrease in Eyai (see Algorithm [^. 

Since, our approach involves the use of modified loss 
functions during the CNN parameter learning process, we will 
discuss their effect on the back-propagation algorithm in the 
next section. 


E. Effect on Error Back-propagation 

In this section, we discuss the impact of the modified loss 
functions on the gradient computation of the back-propagation 
algorithm. 


1) Cost-Sensitive MSE: During the supervised training, the 
MSE loss minimizes the mean squared error between the 
predicted weighted outputs of the model y, and the ground- 
truth labels d, across the entire training set (Eq. (j^). The 
modification of the loss function changes the gradient com¬ 
puted during the back-propagation algorithm. Therefore, for 
the output layer, the mathematical expression of the gradient 
at each neuron is given by: 


de{d,y) 

do„ 


{dfi Vn) 


dVn 

dOn 


The for the cost-sensitive MSE loss can be defined as: 

1 

1-f exp(-o„0,„) 


The partial derivative can be calculated as follows: 

dy-n _ 0,71 6xp( — 

dOn (1 -f exp(-o„0,„))^ 

_ _ 0,71 _ 

(1 -f exp(o„0.„)) (1 -f exp(-o„0,„)) 

Oy?!_ q /-I \ 

77^ — ^p,nyn{^ ~ yn) 

The derivative of the loss function is therefore given by: 

df-{d,y) 

-=-0.71(0-^ 71 )^ 71 ( 1 -^ 71 )- (17) 


2) Cost-Sensitive SVM Hinge Loss: Eor the SVM hinge 
loss function given in Eq. ([^, the directional derivative can 
be computed at each neuron as follows: 


dOn 


-(20-l)|^I{l>y„(20 


!)}• 


The partial derivative of the output of the softmax w.r.t the 
output of the penultimate layer is given by: dy^/do^ = 0 
By combining the above two expressions, the derivative of the 
loss function can be represented as: 

= -(20 - l)0.7il{l > 2/020 - 1)}. (18) 

where, I(-) denotes an indicator function. 

3) Cost-Sensitive CE loss: The cost-sensitive softmax log 
loss function is defined in Eq. Next, we show that the 
introduction of a cost in the CE loss does not change the 
gradient formulas and the cost is rather incorporated implicitly 
in the softmax output y^. The effect of costs on the CE loss 
surface is illustrated in Eig. 


Proposition 1. The introduction of a class imbalance cost 
in the softmax loss (^{■) in Eq. 


k- 

the computation of the gradient during 
process. 


10\, does not affect 
me back-propagation 


Proof: We start with the calculation of the partial deriva¬ 
tive of the softmax neuron with respect to its input: 


dyn 

do„ 


d I 0.7iexp(o„) 


do„ 


E0,feexp(o^) 
k 


( 19 ) 
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Fig. 4: The CE loss function for the case of binary classification. Left: loss surface for a single class for different costs (high cost for first 
class (Cl), no cost, high cost for second class (C2)). Right: the minimum loss values for all possible values of class scores illustrate the 
obvious classification boundaries. The score-level costs reshape the loss surface and the classification boundaries are effectively shifted in 
favour of the classes with relatively lower cost. 


Now, two cases can arise here, either m = n or m n. We 
first solve for the case when n = m: 


LHS = 


exp(o,„) 5^ exp(Ofe) - exp(2o„) 

k 


\ 


\ 


E ^v,k exp(oj^ 

k 


After simplification we get; 

dy, 


do„ 


= ym(l-ym)> s.t.:m = n 


Next, we solve for the case when n m: 

^p,n^p,n GXp(o„.j + 0„) 


LHS = -- 


E^P.feexp(o^) 

k 


= -ymyn,s-t. :mf=n. 


The loss function can be differentiated as follows; 


di{y,d) = 1 


do„ 


Vn do^ ’ 

(/m) ^ ^ ^nym ^ ^ ^nym' 

Tiy^m n 

Since, d is defined as a probability distribution over all output 
classes (E = !)> therefore; 


deiy,d) 

do^ 


= -d„ 


Vr, 


This result is the same as in the case when CE does not 
contain any cost-sensitive parameters. Therefore the costs 
affect the softmax output but the gradient formulas remain 
unchanged. ■ 

In our experiments (Sec. IV i, we will only report per¬ 
formances with the cost-sensitive CE loss function. This is 
because, it has been shown that the CE loss outperforms the 
other two loss functions in most cases [? ]. Moreover, it avoids 
the learning slowing down problem of the MSE loss llTSll . 


IV. Experiments and Results 


The class imbalance problem is present in nearly all real- 
world object and image datasets. This is not because of any 
flawed data collection, but it is simply due to the natural 
frequency patterns of different object classes in real life. Eor 
example, a bed appears in nearly every bedroom scene, but a 
baby cot appears much less frequently. Consequently, from the 
perspective of class imbalance, the currently available image 
classification datasets can be divided into three categories; 


1) Datasets with a significant class imbalance both in the 
training and the testing split (e.g., DIE, MLC), 

2) Datasets with unbalanced class distributions but with 
experimental protocols that are designed in a way that 
an equal number of images from all classes are used 
during the training process (e.g., MIT-67, Caltech-101). 
The testing images can be equal or unequal for different 
classes. 

3) Datasets with an equal representation of each class in the 
training and testing splits (e.g., MNIST, CIEAR-100). 

We perform extensive experiments on six challenging im¬ 
age classification datasets (two from each category) (see 
Sec. IV-BI. Eor the case of imbalanced datasets (1®* category), 
we report results on the standard splits for two experiments. 
Eor the two datasets from the 2"^^ category, we report our 
performances on the standard splits, deliberately deformed 
splits and the original data distributions. Eor the two datasets 
from the category, we report results on the standard 
splits and on deliberately imbalanced splits. Since, our training 
procedure requires a small validation set (Algorithm[2l, we use 
^ 5% of the training data in each experiment as a held-out 
validation set. 


A. Multi-class Performance Metric 

The main goal of this work is to enhance the overall 
classification accuracy without compromising the precision of 
minority and majority classes. Therefore, we report overall 


for comparisons with baseline and state-of-the art balanced 
and unbalanced data classification approaches. We report class 


classification accuracy results in Tables [l]|VI| |VIII and IX 
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recall rates in confusion matrices displayed in Fig. We also 
show our results in terms of G-mean and F-measure scores on 
all the six datasets (see Table IYn|. Note that the F-measure 
and G-mean scores are primarily used for binary classification 
tasks. Here, we extend them to multi-class problem using the 
approach in 0^ . where these scores are calculated for each 
class in a one-vs-all setting and their weighted average is 
calculated using the class frequencies. 

It is also important to note that neural networks give a 
single classification score and it is therefore not feasible to 
obtain ROC curves. As a result, we have not included AUC 
measurements in our experimental results. 

B. Datasets and Experimental Settings 

1) Imbalanced Datasets: Melanoma Detection : Edin¬ 
burgh Dermofit Image Library (DIL) consists of 1300 
high quality skin lesion images based on diagnosis from 
dermatologists and dermatopathologists. There are 10 types 
of lesions identified in this dataset including melanomas, 
seborrhoeic keratosis and basal cell carcinomas. The number 
of images in each category varies between 24 and 331 (mean 
130, median 83). Similar to llJTl . we report results with 3-fold 
cross validation. 

Coral Classification : Moorea Labelled Corals (MLC) 

contains 2055 images from three coral reef habitats during 
2008-10. Each image is annotated with roughly 200 points 
belonging to the 9 classes (4 non-corals, 5 corals). Therefore 
in total, there are nearly 400,000 labelled points. The class 
representation varies approximately from 2622 to 196910 
(mean 44387, median 30817). We perform two of the major 
standard experiments on this dataset similar to IH. The first 
experiment involves training and testing on data from year 
2008. In the second experiment, training is carried out on data 
from year 2008 and testing on data from year 2009. 

2 ) Imbalanced Datasets-Balanced Protocols: Object Clas¬ 
sification: Caltech-101 contains a total of 9,144 images, 
divided into 102 categories (101 objects + background). The 
number of images for each category varies between 31 and 
800 images (mean: 90, median 59). The dataset is originally 
imbalanced but the standard protocol which is balanced uses 
30 or 15 images for each category during training, and testing 
is performed on the remaining images (max. 50). We perform 
experiments using the standard 60%/40% and 30%/70% 
train/test splits. 

Scene Classification: MIT-67 consists of 15,620 images 
belonging to 67 classes. The number of images varies between 
101 and 738 (mean: 233, median: 157). The standard protocol 
uses a subset of 6700 images (100 per class) for training and 
evaluation to make the distribution uniform. We will, however, 
evaluate our approach both on the standard split (80 images 
for training, 20 for testing) and the complete dataset with 
imbalanced train/test splits of 60%/40% and 30%/70%. 

3) Balanced Datasets-Balanced Protocols: Handwritten 
Digit Classification: MNIST consists of 70,000 images of 
digits (0-9). Out of the total, 60,000 images are used for 
training (~600/class) and the remaining 10,000 for testing 
(^ 100/class). We evaluate our approach on the standard split 


Methods Performances 


(using stand, split) 

Exp. 1 (5-classes) 

Exp. 2 (10-classes) 

Hierarchical-KNN (37) 

74.3 ± 2.5% 

68.8 ± 2.0% 

Hierarchical-Bayes |38| 

69.6 ± 0.4% 

63.1 ± 0.6% 

Flat-KNN (L7) 

69.8 ± 1.6% 

64.0 ± 1.3% 

Baseline CNN 

75.2 ± 2.7% 

69.5 ± 2.3% 

CoSen CNN 

80.2 ± 2.5% 

72.6 ± 1.6% 

TABLE I: Evaluation on DIL Database. 

Methods 

Performances 

(using stand, split) 

Exp. 1 (2008) 

Exp. 2 (2008-2009) 

MTM-CCS (LAB) 0 

74.3% 

67.3% 

MTM-CCS (RGB) O 

72.5% 

66.0% 

Baseline CNN 

72.9% 

66.1% 

CoSen CNN 

75.2% 

68.6% 


TABLE II: Evaluation on MLC Database. 


as well as the deliberately imbalanced splits. To imbalance the 
training distribution, we reduce the representation of even and 
odd digit classes to only 25% and 10% of images, respectively. 

Image Classification: CIFAR-IOO contains 60,000 images 
belonging to 100 classes (600 images/class). The standard 
train/test split for each class is 500/100 images. We evaluate 
our approach on the standard split as well as on artificially im¬ 
balanced splits. To imbalance the training distribution, we re¬ 
duce the representation of even-numbered and odd-numbered 
classes to only 25% and 10% of images, respectively. 

C. Convolutional Neural Network 

We use a deep CNN to learn robust feature representations 
for the task of image classification. The network architecture 
consists of a total of 18 weight layers (see Fig. |^for details). 
Our architecture is similar to the state-of-the-art CNN (config¬ 
uration D) proposed in 1^ . except that our architecture has 
two extra fully connected layers before the output layer and 
the proposed loss layer is cost-sensitive. Since there are a huge 
number of parameters (^139 million) in the network, its not 
possible to learn all of them from scratch using a relatively 
smaller number of images. We, therefore, initialize the first 16 
layers of our model with the pre-trained model of 1^ and set 
random weights for the last two fully connected layers. We 
then train the full network with a relatively higher learning 
rate to allow a change in the network parameters. Note that the 
cost-sensitive (CoSen) CNN is trained with the modified cost 
functions introduced in Eqs. ([6pT|). The CNN trained without 
cost-sensitive loss layer will be used as the baseline CNN in 
our experiments. Note that the baseline CNN architecture is 
exactly the same as the CoSen CNN, except that the final layer 
is a simple CE loss layer. 

D. Results and Comparisons 

Eor the two imbalanced datasets with imbalanced protocols, 
we summarize our experimental results and comparisons in 
Tables 00 Eor each of the two datasets, we perform two stan¬ 
dard experiments following the works of Beijbom et al. 11 
and Ballerini et al. lH. In the first experiment on the DIL 
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Input layer Conv. layers (2) Conv. layers (2) Conv. layers (3) Conv. layers (3) Conv. layers (3) 

224x224x3 3x3x3(64 each) 3x3x3(128 each) 3x3x3(256 each) 3x3x3(512 each) 3x3x3(512 each) 

Fig. 5: The CNN architecture used in this work consists of 18 weight layers. 


FC layers (4) Output layer 
(1 X 4096 each) (soft-max loss) 


Actinic Keratosis 
Basal Cell Carcinoma 
Melanocytic Nevus 
Squamous Cell Carcinoma 
Seborrhoeic Keratosis 



Actinic Keratosis 
Basal Cell Carcinoma 
Melanocytic Nevus 
Squamous Cell Carcinoma 
Seborrhoeic Keratosis 



CCA 

89.3 



CCA 

80.1 

1 


Turf 


26.0 


Turf 




Macro 




Macro 




Sand 




Sand 




Acrop 




Acrop 




Pavon 



Pavon 




Monti 


18.6 


Monti 


|38.6| 


Pocill 



Pocill 



Porit 


Porit 



(a) OIL (Baseline-CNN) 


(b) OIL (CoSen-CNN) 


(c) MLC (Baseline-CNN) 


(d) MLC (CoSen-CNN) 


Fig. 6; Confusion Matrices for the Baseline and CoSen CNNs on the DIL and MLC datasets. Results are reported for Experiments 1 and 2 
for the DIL and MLC datasets, respectively. 


Methods (using stand, split) Performances 


Deeply Supervised Nets (3 

99.6% 


Generalized Pooling Func. 1401 

99.7% 


Maxout NIN (41) 

99.8% 


Our approach (f) 

Baseline CNN 

CoSen CNN 

Stand, split ('^600 trn, ^'^100 tst) 

99.3% 

99.3% 

Low rep. (10%) of odd digits 

97.6% 

98.6% 

Low rep. (10%) of even digits 

97.1% 

98.4% 

Low rep. (25%) of odd digits 

98.1% 

98.9% 

Low rep. (25%) of even digits 

97.8% 

98.5% 


TABLE III: Evaluation on MNIST Database. 


Methods (using stand, split) Performances 


Network in Network (T) 

64.3% 


Probablistic Maxout Network 1421 

61.9% 


Representation Learning 1431 

60.8% 


Deeply Supervised Nets (S) 

65.4% 


Generalized Pooling Func. I40t 

67.6% 


Maxout NIN 

71.1% 


Our approach (f) 

Baseline CNN 

CoSen CNN 

Stand, split (500 trn, 100 tst) 

65.2% 

65.2% 

Low rep. (10%) of odd digits 

55.0% 

60.1% 

Low rep. (10%) of even digits 

53.8% 

59.8% 

Low rep. (25%) of odd digits 

57.7% 

61.5% 

Low rep. (25%) of even digits 

57.4% 

61.6% 


dataset, we perform 3-fold cross validation on the 5 classes 
(namely Actinic Keratosis, Basal Cell Carcinoma, Melanocytic 
Nevus, Squamous Cell Carcinoma and Seborrhoeic Keratosis) 
comprising of a total of 960 images. In the second experiment, 
we perform 3-fold cross validation on all of the 10 classes in 
the DIL dataset. We achieved a performance boost of ~ 5.0% 
and ^ 3.1% over the baseline CNN in the first and second 
experiments respectively (Table |^. 

For the MLC dataset, in the first experiment we train on 
two-thirds of the data from 2008 and test on the remaining one 
third. In the second experiment, data from year 2008 is used 
for training and tests are performed on data from year 2009. 
Note that in contrast to the ‘multiple texton maps’ (MTM) ||4| 
approach which extracts features from multiple scales, we only 
extract features from the 224 x 224 dimensional patches. While 
we can achieve a larger gain by using multiple scales with our 
approach, we kept the setting similar to the one used with the 
other datasets for consistency. For similar reasons, we used the 
RGB color space instead of LAB, which was shown to perform 
better on the MLC dataset Q. Compared to the baseline CNN, 


TABLE IV: Evaluation on CIFAR-100 Database. 


we achieved a gain of 2.3% and 2.5% on the first and second 
experiments respectively. Although the gains in the overall 
accuracy may seem modest, it should be noted that the boost in 
the average class accuracy is more pronounced. For example, 
the confusion matrices for DIL and MLC datasets in Fig. 
(corresponding to Exp. 1 and Exp. 2 respectively), show an 
improvement of 9.5% and 11.8% in the average class accuracy. 
The confusion matrices in Eigs. 6^ ^ 6c and 6d also show 
a very significant boost in performance for the least frequent 
classes e.g., Turf, Macro, Monti, AK and SCC. 


Our results for the two balanced datasets, MNIST and 
CIEAR-100, are reported in Tables |^|IV]on the standard splits 
along-with the deliberately imbalanced splits. To imbalance 
the training distributions, we used the available/normal training 
data for the even classes and only 25% and 10% of data 
for the odd classes. Similarly, we experimented by keeping 
the normal representation of the odd classes and reducing the 
representation of the even classes to only 25% and 10%. Our 
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Distinct Classes 


(a) MNIST Training Set 



Distinct Classes 

(d) Caltech-101 Training Set 




0)250 

CO 

E_ 


0 150 
£1 

^ inn 


1 


123456789 


Distinct Classes 
(e) MLC Training Set 



Distinct Classes 
(f) DIL Training Set 


Fig. 7: The imbalanced 
training set 

distributions used 
for the comparisons 
reported in Tables |VIIf 
m Note that for 
the DIL and the 
MLC datasets, these 

distributions are the 
same as the standard 
protocols. For the 
MLC dataset, only 

the training set 

distribution for the first 
experiment is shown 
here which is very 
similar to the training 
set distribution of the 
second experiment 
(best viewed when 

enlarged). 


Methods (using stand, split) Performances 



15 trn. samples 

30 trn. samples 

Multiple Kernels 1441 

71.1 ± 0.6 

78.2 ± 0.4 

LLCt |45l 

— 

76.9 ± 0.4 

Imp. Fisher Kernel^ I46l 

— 

77.8 ± 0.6 

SPM-SC gV) 

73.2 

84.3 

DeCAF 

- 

86.9 ± 0.7 

Zeiler & Fergus l49l 

83.8 ± 0.5 

86.5 ± 0.5 

Chatfield et al. (3 

- 

88.3 ± 0.6 

SPP-net (50) 

- 

91.4 ± 0.7 

Our approach (f) 

Baseline CNN 

CoSen CNN 

Stand, split (15 tm. samples) 

87.1% 

87.1% 

Stand, split (30 tm. samples) 

90.8% 

90.8% 

Org. data distribution 
(60%/40% split) 

88.1% 

89.3% 

Low rep. (10%) of odd classes 

77.4% 

83.2% 

Low rep. (10%) of even classes 

76.1% 

82.3% 

Org. data distribution 
(30%/70% split) 

85.5% 

87.9% 

Low rep. (10%) of odd classes 

74.6% 

80.4% 

Low rep. (10%) of even classes 

75.2% 

80.9% 


TABLE V: Evaluation on Caltech-101 Database (f figures reported 
in (SD). 

results show that the performance of our approach is equal to 
the performance of the baseline method when the distribution 
is balanced, but when the imbalance ratios increase, our 
approach produces significant improvements over the baseline 
CNN (which is trained without using the cost-sensitive loss 
layer). We also compare with the state-of-the-art techniques 
which report results on the standard splij^ and demonstrate 
that our performances are better or comparable. Note that 
for the MNIST digit dataset, nearly all the top performing 
approaches use distortions (affine and/or elastic) and data 
augmentation to achieve a significant boost in performance. 
In contrast, our baseline and cost-sensitive CNNs do not use 

'Note that the standard split on the Caltech-101 and MIT-67 is different 
from the original data distribution (see Sec. |IV-B]for details). 


Methods (using stand, split) 

Performances 

Spatial Pooling Regions 1521 

50.1% 


VC + VO (531 

52.3% 


CNN-SVM (^ 

58.4% 


Improved Fisher Vectors 1551 

60.8% 


Mid Level Representation 1561 

64.0% 


Multiscale Orderless Pooling 1571 

68.9% 


Our approach (f) 

Baseline CNN ' 

CoSen CNN 

Stand, split (80 trn, 20 tst) 

70.9% 

70.9% 

Org. data distribution 

70.7% 

73.2% 

(60%/40% split) 



Low rep. (10%) of odd classes 

50.4% 

56.9% 

Low rep. (10%) of even classes 

50.1% 

56.4% 

Org. data distribution 

61.9% 

66.2% 

(30%/70% split) 



Low rep. (10%) of odd classes 

38.7% 

44.7% 

Low rep. (10%) of even classes 

37.2% 

43.4% 

TABLE VI: Evaluation on MIT-67 Database. 

Dataset F-measure 

G-mean 

Baseline CNN CoSen CNN Baseline CNN 

CoSen CNN 

MNIST 0.488 0.493 

0.987 

0.992 

CIFAR-IOO 0.283 0.307 

0.736 

0.766 

Caltech-101 0.389 0.416 

0.873 

0.905 

MIT-67 0.266 0.302 

0.725 

0.772 

DIL 0.343 0.358 

0.789 

0.813 

MLC 0.314 0.338 

0.635 

0.723 


TABLE VII: The table shows the F-measure and G-mean scores for 
the baseline and cost-sensitive CNNs. The experimental protocols 
used for each dataset are shown in Fig. |7] CosSen CNN consistently 
outperforms the Baseline CNN on all datasets. 


any form of distortions/augmentation during the training and 
testing procedures on MNIST. 

We also experiment on the two popular classification 
datasets which are originally imbalanced, and for which the 
standard protocols use an equal number of images for all 
training classes. For example, 30 or 15 images are used for 
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Datasets 

(Imbalaned 

protocols) 



Performances 





Experimental Over-sampling 

Setting (SMOTE (H) 

Under-sampling 
(RUS Ol) 

Hybrid-sampling 
(SMOTE-RSB* (1) 

CoSen SVM 
(WSVM OH) 

CoSen RF 

(WRF oni) 

SOSR 
CNN (H 

Baseline 

CNN 

CoSen 

CNN 

MNIST 

10% of odd classes 94.5% 

92.1% 

96.0% 

96.8% 

96.3% 

97.8% 

97.6% 

98.6% 

CIFAR-100 

10% of odd classes 32.2% 

28.8% 

37.5% 

39.9% 

39.0% 

55.8% 

55.0% 

60.1% 

Caltech-101 

60% tm, 10% of odd cl. 67.7% 

61.4% 

68.2% 

70.1% 

68.7% 

77.4% 

77.4% 

83.2% 

MIT-67 

60% tm, 10% of odd cl. 33.9% 

28.4% 

34.0% 

35.5% 

35.2% 

49.8% 

50.4% 

56.9% 

OIL 

Stand, split {Exp. 2) 50.3% 

46.7% 

52.6% 

55.3% 

54.7% 

68.9% 

69.5% 

72.6% 

MLC 

stand, split (Exp. 2) 38.9% 

31.4% 

43.0% 

47.7% 

46.5% 

65.7% 

66.1% 

68.6% 


TABLE VIII: Comparisons of our approach with the state-of-the-art class-imbalance approaches. The experimental protocols used for each 
dataset are shown in Fig. With highly imbalanced training sets, our approach significantly out-performs other data sampling and cost- 
sensitive classifiers on all four classification datasets. 


the case of Clatech-101 while 80 images per category are used 
in MIT-67 for training. We report our results on the standard 
splits (Tables |V] |Vl]), to compare with the state-of-the-art ap¬ 
proaches, and show that our results are superior to the state-of- 
the-art on MIT-67 and competitive on the Caltech-101 dataset. 
Note that the best-performing SPP-net ll50l uses multiple sizes 
of Caltech-101 images during training. In contrast, we only use 
a single consistent size during training and testing. We also 
experiment with the original imbalanced data distributions to 
train the CNN with the modified loss function. For the orig¬ 
inal data distributions, we use both 60%/40% and 30%/70% 
train/test splits to show our performances with a variety of 
train/test distributions. Moreover, with these imbalanced splits, 
we further decrease the data of odd and even classes to just 
10% respectively, and observe a better relative performance of 
our proposed approach compared to the baseline method. 


We report F-measure and G-mean scores on all the six 
datasets in Table IVIII The metric calculation details are 
provided in Sec. IV-A The most unbalanced splits (Fig. 
are used for each dataset to clearly demonstrate the benefit of 
class-specific costs. We note that the cost-sensitive CNN model 
clearly out-performs the baseline model for all experiments. 

The comparisons with the best approaches for class- 
imbalance learning are shown in Table VIII Note that we used 
a high degree of imbalance for the case of all six datasets to 
clearly show the impact of the class imbalance problem on the 
performance of the different approaches (Fig|^. For fairness 
and conclusive comparisons, our experimental procedure was 
kept as close as possible to the proposed CoSen CNN. For 
example, for the case of CoSen Support Vector Machine 
(SVM) and Random Forest (RF) classifiers, we used the 4096 
dimensional features extracted from the pre-trained deep CNN 
(D) lf39l . Similarly, for the cases of over and under-sampling, 
we used the same 4096 dimensional features, which have 
shown to perform well on other classification datasets. A two¬ 
layered neural network was used for classification with these 
sampling procedures. We also report comparisons with all 
types of data sampling techniques i.e., over-sampling (SMOTE 
0), under-sampling (Random Under Sampling - RUS ED) 
and hybrid sampling (SMOTE-RSB* 0). Note that despite the 
simplicity of the approaches in |l5][58l, they have been shown 
to perform very well on imbalanced datasets in data mining 
lfT4l 1^ . We also compare with the cost-sensitive versions 
of popular classifiers (weighted SVM 1591 and weighted RF 
iQl). For the case of weighted SVM, we used the standard 


Datasets 

(Imbalaned 

protocols) 


Performances 


CoSen-CNN 
Fixed Cost (H) 

CoSen-CNN 
Fixed Cost (S) 

CoSen-CNN 
Fixed Cost (M) 

CoSen-CNN 

Adap. 

MNIST 

97.2% 

97.2% 

97.9% 

98.6% 

CIFAR-100 

55.2% 

55.8% 

56.0% 

60.1% 

Caltech-101 

76.2% 

77.1% 

77.7% 

83.0% 

MIT-67 

51.6% 

50.9% 

49.7% 

57.0% 

DIL 

70.0% 

69.5% 

69.3% 

72.6% 

MLC 

66.3% 

66.8% 

65.7% 

68.6% 


TABLE IX: Comparisons of our approach (adaptive costs) with the 
fixed class-specific costs. The experimental protocols used for each 
dataset are shown in Fig. Fixed costs do not show a significant 
and consistent improvement in results. 


implementation of LIBSVM l62l and set the class-dependent 
costs based on the proportion of each class in the training 
set. Finally, we experiment with a recent cost-sensitive deep 
learning based technique of Chung et al. l30l . Unlike our 
approach, 1^ does not automatically learn class-specific 
costs. To have a fair comparison, we incorporate their proposed 
smooth one-sided regression (SOSR) loss as the last layer of 
the baseline CNN model in our experiments. Similar to IMl . 
we use the approach proposed in l63l to generate fixed cost 
matrices. Our proposed approach demonstrates a significant 
improvement over all of the cost-sensitive class imbalance 
methods. 

Since our approach updates the costs with respect to the 
data statistics (i.e., data distribution, class separability and 
classification errors), an interesting aspect is to analyse the 
performance when the costs are fixed and set equal to these 
statistics instead of updating them adaptively. We experiment 
with fixed costs instead of adaptive costs in the case of CoSen- 
CNN. For this purpose, we used three versions of fixed costs, 
based on the class representation (H), data separability (S) 
and classification errors (M). Table shows the results for 
each dataset with four different types of costs. The results 
show that none of the fixed costs significantly improve the 
performance in comparison to the adaptive cost. This shows 
that the optimal costs are not the H, S and M themselves, 
rather an intermediate set of values give the best performance 
for cost-sensitive learning. 

Lastly, we observed a smooth reduction in training and 
validation error for the case of cost-sensitive CNN. We show a 
comparison of classification errors between baseline and cost- 
sensitive CNNs at different training epochs in Fig. 
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Fig. 8 : An observed decrease in the training and validation error on 
the DIL dataset (stand, split, Exp. 2) for the cases of the baseline 
and cost-sensitive CNNs. 


Timing Comparisons: The introduction of the class- 
dependent costs did not prove to be prohibitive during the 
training of the CNN. For example, on an Intel quad core i7- 
4770 CPU (3.4GHz) with 32Gb RAM and Nvidia GeForce 
GTX 660 card (2GB), it took 80.19 secs and 71.87 secs to 
run one epoch with and without class sensitive parameters, 
respectively for the MIT-67 dataset. At test time, the CoSen 
CNN took the same amount of time as that of the baseline 
CNN, because no extra computations were involved during 
testing. 


V. Conclusion 

We proposed a cost-sensitive deep CNN to deal with the 
class-imbalance problem, which is commonly encountered 
when dealing with real-world datasets. Our approach is able 
to automatically set the class-dependent costs based on the 
data statistics of the training set. We analysed three commonly 
used cost functions and introduced class-dependent costs for 
each case. We show that the cost-sensitive CE loss function 
is c-calibrated and guess aversive. Furthermore, we proposed 
an alternating optimization procedure to efficiently learn the 
class-dependent costs as well as the network parameters. Our 
results on six popular classification datasets show that the 
modified cost functions perform very well on the majority as 
well as on the minority classes in the dataset. 

Acknowledgment 

This research was supported by an IPRS awarded by The Uni¬ 
versity of Western Australia and Australian Research Council (ARC) 
grants DP 150100294 and DEI20102960. 


As indicated in Sec. 3.1, the above expression holds for all p 7 ^ p*. 
For a total number of N classes and an optimal prediction p*, there 
are A — 1 of the above relations. By adding up the left and the right 
hand sides of these N — 1 relations we get: 


p(p*lx) 


p-1 < 




^ P(g|x) ^ ^;^_(Ar-l)C;.,, 




1 p=7p* 


This can be simplified to: 

'EiCui-Nc;..!' 

— ACp*.JV. 


> 0 , 


where, Px = [P(l|x),... ,P(A|x)]. Note that the posterior prob¬ 
abilities Px are positive (J]]; T’(ilx) = 1 and P(i|x) > 0). It can 
be seen from the above equation that the addition of any constant c, 
does not affect the overall relation, i.e., for any column j. 


+ c) - + c) = ^ - Ac;.., 

i i 


Therefore, the columns of the cost matrix can be shifted by a constant 
c without any effect on the associated risk. ■ 

Lemma A.2. The cost of the true class should be less than the mean 
cost of all misclassification. 

Proof: Since, Px can take any distribution of values, we end up 
with the following constraint: 


For a correct prediction p*, P(p*|x) > P(p|x),Vp 7 ^ p*. Which 
implies that: 


f'. . 

,p’‘ 





It can be seen that the cost insensitive matrix (when diag(^') = 0 
and j = l,Vy 7 ^ i) satisfies this relation and provides the upper 
bound. ■ 


Lemma A.3. The cost matrix ^ for a cost-insensitive loss function 
is an all-ones matrix, rather than a 1 — I matrix, as in the 

case of the traditionally used cost matrix ^ . 

Proof: With all costs equal to the multiplicative identity i.e., 
ip,q = 1, the CNN activations will remain unchanged. Therefore, 
all decisions have a uniform cost of 1 and the classifier is cost- 
insensitive. ■ 


Appendix A 

Prooes Regarding Cost Matrix 

Lemma A.l. Offsetting the columns of the cost matrix by any 
constant ‘c’ does not affect the associated classification risk 77. 

Proof: From Eq. 1, we have: 

^C\qP{q\^) <^C,qP{q\yt) 'ipf-p* 

<3 <3 

which gives the following relation: 

■P(p*|x) < 

X! (^P.? “^P*.?) ’ '^PT^P* 

q^p* 


Lemma A.4. All costs in ^ are positive, i.e., ^ 0. 

Proof: We adopt a proof by contradiction. Let us suppose that 
~ 0- During training in this case, the corresponding score for 
class q (Sp,q) will always be zero for all samples belonging to class p. 
As a result, the output activation (pq) and the back-propagated error 
will be independent of the weight parameters of the network, which 
proves the Lemma. ■ 

Lemma A.5. The cost matrix ^ is defined such that all of its elements 
in are within the range (0,1], i.e., ^p.q G (0,1]. 

Proof: Based on Lemmas |A.3| and [A.4[ it is trivial that the costs 
are with-in the range ( 0 , 1 ]. ■ 

Lemma A.6. Offsetting the columns of the cost matrix ^ can lead 
to an equally probable guess point. 
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Proof: Let us consider the case of a cost-insensitive loss func¬ 
tion. In this case, ^ = 1 (from Lemma |A.3| >. Offsetting all of its 
columns hy a constant c = 1 will lead to ^ = 0. For ^ = 0, the 
CNN outputs will be zero for any G R^. Therefore, the classifier 
will make a random guess for classification. ■ 
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